# Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization

# Zhiru Zhang School of ECE, Cornell University

September 29, 2017 @ EPFL





Cornell University

# **A Case Study on Digit Recognition**



# LUT-based Technology Mapping

- Look-up tables (LUTs): the core building blocks of FPGAs
  - A k-LUT can implement any K-input 1-output Boolean function, or any k-feasible cone in the logic network



Delay estimation in HLS is inaccurate without considering LUT mapping

# **Scheduling and Mapping Interdependence**



#### **Mapping-Aware Scheduling** [FPGA'15]





M. Tan, S. Dai, U. Gupta, Z. Zhang, Mapping-Aware Constrained Scheduling for LUT-Based FPGAs, FPGA'2015.

# What about Post-RTL Synthesis?

Case 1: Min-area mapping without logic restructuring

– Already NP-hard <sup>[1]</sup>

Case 2: with logic restructuring

- Even harder to find optimal solution
- Existing approach: heuristically transform logic network for better mapping quality

Focus of this talk



# **Typical Pre-mapping Transformation Sequence**

• A typical area-minimizing script in ABC:



A predetermined technology-independent optimization sequence

# **Autotuning Logic Synthesis Tool**

Applying DATuner, a distributed autotuning framework, to auto determine the logic transformation sequence



github.com/cornell-zhang/datuner

C. Xu, G. Liu, R. Zhao, S. Yang, G. Luo, and Z. Zhang, **A Parallel Bandit-Based Approach for Autotuning** 7 **FPGA Compilation**, FPGA'2017.

# **DATuner: Dynamic Solution Space Partitioning**

- Separating promising from unpromising subspaces
  - Guided by information gain derived from QoR of known samples



#### **Autotuning vs. ABC: Unconstrained Area Minimization**



- ABC Optimized: designs optimized with *compress2rs* script
- DATuner: a budget of 128 ABC runs across 4 machines
- EPFL benchmarks: <u>http://lsi.epfl.ch/benchmarks</u>

#### Autotuning vs. Best Known Records (v2017.1)



- ABC Optimized: designs optimized with *compress2rs* script
- DATuner: a budget of 128 ABC runs across 4 machines
- EPFL benchmarks: <u>http://lsi.epfl.ch/benchmarks</u>
  - Best known results: from EPFL record, version 2017.1

#### **PIMap: Parallelized Mapping-Aware Logic Synthesis** [FPGA'17]

- Mapping-guided logic transformations
  - Iteratively improve area



- Effective partitioning and parallelization technique
  - Improve both runtime and design quality



G. Liu and Z. Zhang, A Parallelized Iterative Improvement Approach to Area Optimization for LUT-Based 11 Technology Mapping, FPGA'2017.

<u>Use mapping result to guide randomly</u> proposed logic transformations

# Use mapping result to guide randomly proposed logic transformations





# Use mapping result to guide randomly proposed logic transformations







# Use mapping result to guide randomly proposed logic transformations





[1] Hastings, Biometrika'70

# Use mapping result to guide randomly proposed logic transformations



# Use mapping result to guide randomly proposed logic transformations



[1] Hastings, Biometrika'70



# **Need for Partitioning**

- Without partitioning
  - Long runtime per trial
  - Easily stuck at local minimum



----No partition

#### Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs











Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs



Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs

14 LUTs





15 LUTs

Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs



Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs



# **PIMap Technique: Repartition**

Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs



Repartition using different seeds

# **PIMap Technique: Repartition**

Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs



# **PIMap Technique: Repartition**

Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs



# **Partitioning Schemes**

- Without partitioning
  - Long runtime per trial
  - Easily stuck in local optima
- Fine-grained partition
  - Bear a similarity to exact synthesis
  - Fast runtime per trial
  - But slow progress overall



# **Partitioning Schemes**

- Without partitioning
  - Long runtime per trial
  - Easily stuck in local optima
- Fine-grained partition
  - Bear a similarity to exact synthesis
  - Fast runtime per trial
  - But slow progress overall

#### Coarse-grained partition

- Balance runtime and solution quality
- Repartition between trials to further improve quality



# **PIMap Overall Flow**

# Design C1908 from the MCNC benchmark suite 5 trials in total



#### **Initial Design**

#### Observations:

- 1. Partition boundaries vary between trials
  - → Uncover better structure
- 2. Overall network structure differ significantly between trials
  - Discover a wide range of designs

# **Experimental Setup**

| PIMap toolchain                           |                                                                      | Benchmarks                                   |                                                             |
|-------------------------------------------|----------------------------------------------------------------------|----------------------------------------------|-------------------------------------------------------------|
|                                           | ABC's logic<br>transformations:<br><i>balance, rewrite, refactor</i> | Benchmark                                    | Initial design                                              |
| ABC's tech<br>mapper                      |                                                                      | 10 largest<br>MCNC<br>designs <sup>[1]</sup> | pre-synthesized using<br>ABC's <i>compress2rs</i><br>script |
| Iterative area<br>minimization<br>routine | Subgraph extraction<br>and parallelization<br>control                | EPFL<br>arithmetic<br>designs <sup>[2]</sup> | best-known mapping<br>designs <sup>[2]</sup>                |

[1] Yang, MCNC'91

[2] Amarù, et al., http://lsi.epfl.ch/benchmarks

Setup

| Configuration      | 40 trials, 100 iterations of area minimization per trial |
|--------------------|----------------------------------------------------------|
| Partitioning       | up to 16 subgraphs, each with up to 100 LUTs             |
| Computing resource | up to 8 machines, each with a quad-core Xeon processor   |

# **Area Minimization Results**



Initial design: <u>best-known results</u> from EPFL record

# **Area Minimization Results**



■ Initial design ■ 5 trials ■ 10 trials ■ 40 trials

- Initial design: <u>best-known results</u> from EPFL record
- Area improvements
  - EPFL: 7% on average, up to 14%

# **Area Minimization Results**



■ Initial design ■ 5 trials ■ 10 trials ■ 40 trials

- Initial design: <u>best-known results</u> from EPFL record
- Area improvements
  - EPFL: 7% on average, up to 14%
  - Can effectively handle large circuit (~44k LUTs)
- Also able to improve all 10 control-intensive designs in EPFL benchmark suite

## LUT Count vs. Gate Count Reduction

*Verified:* **post-mapping** area **does not necessarily** correlate with **pre-mapping** area



### **Partition Granularity vs. Runtime**

Trade-off between runtime vs. progress per trial

Optimal subgraph size is around 100 LUTs



# **Depth Constrained Area Minimization on MCNC**

- Constraint: no depth increase compared to initial design
  - Initial designs generated by ABC's depth-minimizing resyn2 script
- Area improvements under depth constraint for MCNC benchmarks





#### **Extending PIMap to Approximate Logic Synthesis**



#### Statistically Certified Approximate Synthesis [ICCAD'17]



- Previous approaches (sample-based testing):
  - Randomly pick N input vectors, then simulate, error rate =  $N_{incorrect}/N$
- Our approach: Formally quantify errors using hypothesis testing

| Testing different types of error me |
|-------------------------------------|
|-------------------------------------|

| Error metric            | Test target       | Test statistic |
|-------------------------|-------------------|----------------|
| Error rate              | Sample occurrence | Binomial test  |
| Average error magnitude | Sample mean       | T-test         |
| Error variance          | Sample variance   | $\chi^2$ -test |

# **PIMap Summary**

- Current logic synthesis flow still leaves nontrivial room for area improvement (up to 30%)
- Parallelized stochastic optimization is an effective approach for technology-aware synthesis
- Similar opportunities exist in RTL and high-level synthesis



Chortle: Francis, et al., DAC'90 DAGMap: Chen, et al., DT'92 FlowMap: Cong and Ding, TCAD'94 CutMap: Cong and Hwang, FPGA'95 DAOMap: Chen and Cong, ICCAD'04 K and L: Kao and Lai, TDAES'05 Imap: Manohararajah, et al., TCAD'06 ABC Map: Mishchenko, et al., TCAD'07 Exact synthesis: Haaswijk, et al., ASPDAC'17